Enhancing Performance in Single Processor Computing
Exploring how parallel processing techniques can be implemented in single processor systems to improve performance and efficiency
In the realm of modern computing, the pursuit of enhanced performance and efficiency has led to significant advancements in parallel processing techniques. Parallelism, a key concept in computing, involves executing multiple processes or tasks simultaneously to optimize computational speed and resource utilization.
Tasks are executed sequentially, one after the other
Multiple tasks executed simultaneously to optimize speed
Single processor with parallel capabilities
Multiple processors working together
Multiple cores on a single processor chip
System's ability to handle increasing workloads by adding more resources
Effective utilization of resources without overloading any single component
A usual single processor computer has three major parts: main memory, a central processing unit, and input/output devices.
Main controller with sixteen 32-bit general purpose registers, one register as program counter
Special register with information about current state of processor and program
Arithmetic logic unit with optional floating point accelerator and local cache memory
Operator interface connected to floppy disk
The CPU, main memory, and I/O devices all connect to a common bus called the synchronous backplane interconnect. Through this bus, all I/O devices can communicate with each other, the CPU, or memory. Peripheral storage and I/O devices can connect directly to the bus through a controller.
Parallel computing is the method done by computer systems to execute multiple instructions at the same time by allocating each task to different processors. This capability is done in a uniprocessor system through various techniques such as utilizing multi-core processors or multiple cores within a single processor chip, separating a job into smaller sub-tasks that can be processed concurrently, or leveraging specialized hardware or software to coordinate parallel processing.
Technique that allows a processor to execute a set of instructions simultaneously by dividing the instructions execution process into several stages
Method that permits a single processor to run multiple tasks at the same time by dividing processor's time into short intervals
Pipelining allows a processor to carry out multiple instructions at the same time by dividing the execution process into several phases. Each stage in the pipeline operates on a different instruction concurrently, allowing one instruction to be fetched from memory while another is being executed.
This parallelism enhances the throughput of the processor and improves performance.
Multitasking works by dividing the processor's time into short intervals and rapidly changing between tasks. Each task gets allocated a particular time slot to execute. Although the processor executes only one task at a time, this rapid switching creates the illusion of parallel processing.
These methods enhance the performance of a single processor. However, as the number of tasks or instructions running at the same time grows, the performance eventually decreases. Therefore, a multiprocessor is required here to boost performance for highly parallel tasks.
Improves the performance of a uniprocessor by allowing it to execute multiple tasks or instructions simultaneously. This is achieved by increasing throughput which reduces the time required to complete a particular task.
Parallelism in uniprocessor is cost-effective for applications that do not require the performance of a multiprocessing system. The cost of a uniprocessor with parallelism is often lower compared to a multiprocessing system.
A uniprocessor consumes less power than a multiprocessor system which makes it suitable for mobile and battery powered devices.
Modern smartphones use uniprocessor chips with multiple cores (e.g., octa-core processors). These chips implement parallelism through pipelining and multitasking to provide smooth user experience while maintaining low power consumption for battery life.
6-core processor with 2 performance cores and 4 efficiency cores, using pipelining for parallel execution
8-core processor with advanced pipelining and multitasking capabilities for Android devices
Parallelism is achieved in a very limited way and as the number of tasks or instructions being executed simultaneously increases the performance decrease. This makes it unsuitable for applications that require high levels of parallelism.
It has limited processing power as compared to a multiprocessing system hence it is not suitable for applications that require high computational power like scientific simulations and large-scale data processing.
Implementing parallelism in a uniprocessor can be complex as it requires careful design and optimization to ensure that the system operates correctly and efficiently this increases the development and maintenance costs of the system.
While gaming laptops with high-end uniprocessor chips can handle most games well, they struggle with extremely demanding tasks like real-time ray tracing or complex physics simulations at high settings. These tasks require the massive parallel processing power of dedicated GPUs or multi-processor systems.
Runs well on high-end uniprocessor systems but requires GPU acceleration for advanced ray tracing features
Complex simulations require multi-processor systems for real-time processing
In multimedia applications such as video and audio playback, image processing, and 3D graphics rendering it helps in increasing performance.
Provides assistance to web servers by allowing them to handle multiple requests simultaneously which makes it more reliable.
It improves performance in artificial intelligence and machine learning applications allowing them to process large amounts of data more quickly.
Parallelism performs scientific simulations such as weather forecasting, fluid dynamics, and molecular modeling.
Parallelism in uniprocessors is used to improve the performance of database management systems by allowing them to handle large volumes of data more efficiently.
Applications like Adobe Premiere Pro use uniprocessor parallelism for real-time video preview and rendering
Unity and Unreal Engine utilize pipelining in uniprocessors for smooth game performance
Chrome and Firefox use multitasking to handle multiple tabs and web processes simultaneously
In earlier computers, the central processing unit (CPU) had just one arithmetic logic unit that could only carry out one function at a time. This slowed down the execution of long sequences of arithmetic instructions. To improve this, the number of functional units in the CPU was increased so that parallel and simultaneous arithmetic operations could be performed.
The CDC-6600 computer has ten different functional units built into its central processing unit:
Handles fixed-point addition operations
Handles fixed-point multiplication operations
Handles fixed-point division operations
Handles floating-point addition operations
Handles floating-point multiplication operations
Handles floating-point division operations
Handles increment operations
Handles shift operations
Handles boolean operations
Handles branch operations
These ten units work independently and can run at the same time. A scoreboard keeps track of which functional units and registers are available. With 10 functional units and 24 registers, the instruction issue rate can be greatly increased.
Another great example of a multifunction uniprocessor is the IBM 360/91. It has two parallel execution units: one for integer arithmetic and one for floating point arithmetic. The floating point unit has two functional units inside it - one for float add/subtract and one for float multiply/divide. The IBM 360/91 is a highly pipelined, multifunction scientific processor.
Parallel adders that use methods like carry-lookahead and carry-save are not integrated into all arithmetic logic units, unlike the bit-serial adders used in early computers. Techniques like high-speed multiplier recoding and convergent division allow parallel processing and sharing of hardware components for multiply and divide operations.
The execution of instructions is now divided into multiple pipeline stages, including fetching the instruction, decoding it, fetching operands, executing the arithmetic logic, and storing the result. To allow overlapped execution of instructions through the pipeline, techniques like instruction prefetching and data buffering have been developed.
The input/output (I/O) operations can be carried out at the same time as the CPU computations through the use of separate I/O controllers, channels, or I/O processors. A direct memory access (DMA) channel enables direct transfer of information between the I/O devices and main memory. DMA operates by cycle stealing, which is transparent to the CPU. Additionally, I/O multiprocessing such as utilizing I/O processors in the CDC-6600 can accelerate data transfer between the CPU and external devices.
Allows I/O devices to transfer data directly to/from memory without CPU intervention
DMA uses bus cycles when CPU is not using them, making it transparent to CPU operations
Specialized processors that handle I/O operations independently of main CPU
The CPU is about 1000 times faster than memory access. A hierarchical memory system can be used to close up the speed gap.
The most internal level is the register files that can be directly accessed by the ALU. The cache memory can function as a buffer between the CPU and main memory. Block access of main memory can be accomplished through multiway interleaving across parallel memory modules.
In general, the CPU is the fastest unit in computer with a processor cycle of tp of tens of nanoseconds. The main memory has a cycle time tm of hundreds of nanoseconds and I/O devices are the slowest with an average of access time td of few milliseconds. It is observed that:
td > tm > tp
For example, the IBM 370/168 has td of 8 ms, tm = 360 ns and tp = 90 ns. With these speed gaps between the subsystems, we need to match their processing bandwidth in order to avoid a system bottleneck problem.
Number of memory words that can be accessed per unit time: Bm = W / tm
Maximum CPU computation rate (e.g., 160 megaflops in Cray-1, 12.5 million instructions per second in IBM 370/168)
Actual performance achieved: Bu โค Bp
Use fast cache memory between CPU and main memory with access time similar to CPU
Acts as data/instruction buffer, transferring blocks of memory words from main memory
Use communication channels with different speeds between slow I/O devices and main memory
I/O channels execute buffering and multiplexing functions to move data from multiple devices
Disk controllers or database machines can filter non-relevant data directly from tracks
Within a given time period, multiple processes may be running concurrently in a computer system. These processes compete for memory, input/output, and CPU resources. We know that some programs are CPU-intensive while others are I/O-intensive. We can execute a mix of program types to balance usage across different hardware components. Interleaving program execution is meant to enable better utilization through overlapping of I/O and CPU operations.
When a process P1 is occupied with I/O, the scheduler can switch the CPU to process P2. This allows multiple programs to run simultaneously. When P2 finishes, the CPU can switch to P3. Note that interleaving I/O and CPU work and CPU wait times are greatly reduced.
The interleaving of CPU and I/O operations across multiple programs is called multiprogramming.
Multiprogramming on a single processor involves the CPU being shared by many programs. Sometimes, a high priority program may occupy the CPU for a long time which prevents other programs from sharing it. This issue can be resolved through a method called timesharing.
Timesharing builds on multiprogramming by assigning fixed or variable time slots to multiple programs. This provides equal opportunities for all programs competing to use the CPU.
The timesharing use of the CPU by multiple programs on a single processor computer creates the concept of virtual processors. Each program behaves as if it has its own dedicated processor, even though they're all sharing the same physical CPU.
Timesharing is especially effective for computer systems connected to many interactive terminals. Each user at a terminal can interact with the computer. Timesharing was first developed for single processor systems. It has also been extended to multi-processor systems.
Use time sharing to allow multiple users to interact with the system simultaneously
Early mainframes used time sharing to serve multiple terminal users
Modern web servers use time sharing concepts to handle multiple client requests
| Aspect | Multiprogramming | Time Sharing |
|---|---|---|
| Primary Goal | Maximize CPU utilization by overlapping I/O and CPU operations | Provide responsive interactive computing to multiple users |
| CPU Allocation | Based on I/O operations (CPU switches when process waits for I/O) | Based on fixed/variable time slices (quantum) |
| User Interaction | Not primarily designed for interactive use | Designed for interactive use with terminals |
| Response Time | May vary significantly depending on system load | More consistent response time for interactive users |
| Examples | Early batch processing systems | Unix, Multics, modern interactive systems |
Single processor systems can achieve parallelism through various hardware and software techniques
Include multiple functional units, pipelining, hierarchical memory, and overlapped I/O operations
Include multiprogramming and time sharing to maximize resource utilization
Bandwidth balancing between subsystems is crucial for optimal performance
Parallel processing techniques in uniprocessor systems have revolutionized computing by enabling significant performance improvements without the need for multiple processors. These techniques are fundamental to modern computing devices, from smartphones to supercomputers.
Smartphones, tablets, and laptops use uniprocessor parallelism for smooth user experience
Servers and workstations utilize these techniques for efficient resource management
Even in single-processor systems, parallelism enables complex calculations and simulations
As computing demands continue to grow, the principles of parallelism in uniprocessor systems remain relevant. However, for extremely high-performance requirements, multi-processor and multi-core systems become necessary. The future lies in hybrid approaches that combine the best of both worlds.
Combining uniprocessor parallelism techniques with multi-core designs
More sophisticated pipeline designs with deeper and wider stages
Specialized uniprocessor designs optimized for artificial intelligence workloads
Parallelism in uniprocessor systems demonstrates that even with a single processing unit, significant performance improvements can be achieved through clever hardware and software design. These techniques form the foundation of modern computing and continue to evolve to meet the ever-increasing demands for computational power.